Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ retry connecting the cloudevents client #280

Conversation

skeeey
Copy link
Member

@skeeey skeeey commented Sep 15, 2023

Summary

Retry connecting the cloudevents client when the client has network error

@openshift-ci openshift-ci bot requested review from deads2k and mdelder September 15, 2023 01:36
@skeeey
Copy link
Member Author

skeeey commented Sep 15, 2023

/assign @qiujian16

@skeeey
Copy link
Member Author

skeeey commented Sep 15, 2023

/hold

@skeeey skeeey changed the title ✨ expose the network error hook WIP ✨ expose the network error hook Sep 18, 2023
@skeeey skeeey force-pushed the cloudevents-connecterr branch from 344c2bf to 89b4d02 Compare September 19, 2023 10:05
@skeeey skeeey changed the title WIP ✨ expose the network error hook WIP ✨ retry connecting the cloudevents client Sep 19, 2023
@skeeey skeeey force-pushed the cloudevents-connecterr branch 2 times, most recently from 2ac7fb0 to f27e8ba Compare September 20, 2023 03:09
@skeeey skeeey changed the title WIP ✨ retry connecting the cloudevents client ✨ retry connecting the cloudevents client Sep 20, 2023
@skeeey
Copy link
Member Author

skeeey commented Sep 20, 2023

/unhold

// TODO enhance the cloudevents SKD to avoid wrapping the error type, then we can only handle the
// net connection errors here
klog.Errorf("failed to reconnect, %v", err)
<-connBackoffManager.Backoff().C()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it goes into this branch, will it hang on the select in the next loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the select in the next loop need to wait the backoff, the backoff is a blocking action

if receiverCancel != nil {
receiverCancel()
}
retrying = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just set c.cloudEventsClient = nil here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cloudEventsClient is also used by send method, if we set it to nil here, the send may meet the nil pointer, so I did not set it to nil, what's your suggestion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though, the send will fail right? I think there might need a rwlock in send otherwise there is always a change that the two threads race.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean after backoff it will go to select in the next loop while client is not instantiated, and it will not have a chance to receive message from c.cloudEventsErrChan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the send will be fail, we maybe need a lock here for the client

the c.cloudEventsErrChan will be only send by a running client, so if the client is retrying, the select will not be blocked the next retry

@skeeey skeeey force-pushed the cloudevents-connecterr branch 2 times, most recently from b58b1c2 to d9de5d7 Compare September 21, 2023 07:53

for {
if cloudEventsClient == nil {
klog.V(4).Infof("reconnecting the cloudevents client")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing missing is: what if the client just send but not subscribe? Do we alway assume that the client will send and subscribe together?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... I'm not sure, maybe we cannot assume this, in our case, it seems that for our case the send and sub are always together, but as a lib, we don't stop to separate the send and subscribe, user can send in one process, and sub in other process, maybe we need to have one retry mechanism for the client

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate the reconnect part from the subscription, PTAL

@skeeey skeeey force-pushed the cloudevents-connecterr branch 3 times, most recently from 264e03f to 4c4356b Compare September 22, 2023 09:27

type receiveFn func(ctx context.Context, evt cloudevents.Event)

type baseClient struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a separated Connect(ctx context.Context) in baseClient? so we do not start goroutine during new?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use connect method instead of new method

@skeeey skeeey force-pushed the cloudevents-connecterr branch from 4c4356b to d80ec98 Compare September 25, 2023 05:12
Copy link
Member

@qiujian16 qiujian16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 25, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: qiujian16, skeeey

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@qiujian16
Copy link
Member

Maybe @yanmxa or @morvencao could also take a look

// make sure the current client is the newest
c.RLock()
cloudEventsClient = c.cloudEventsClient
c.RUnlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need to lock it here? since it is locked at the beginning of the subscribe function

Copy link
Member Author

@skeeey skeeey Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we need it, the beginning lock is for the subscribe function (it ensure there is only one subscription go routine running), this lock is in a independent go routine (the subscribe lock will be released after this go routine starts)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, right, thanks for the explanation.

@morvencao
Copy link
Member

looks good to me, leave this to @yanmxa

@yanmxa
Copy link
Member

yanmxa commented Sep 27, 2023

Reference: cloudevents/sdk-go#940

/LGTM

@openshift-ci openshift-ci bot added the lgtm label Sep 27, 2023
@openshift-merge-robot openshift-merge-robot merged commit c2b66fa into open-cluster-management-io:main Sep 27, 2023
14 checks passed
@skeeey skeeey deleted the cloudevents-connecterr branch September 27, 2023 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants